December 17th, 2015

Bio

  • B.A. in Math & Economics, Saint John’s University (06-10)
  • Data Analyst, College Enrollment & Financial Aid Consulting Firm (10-11)
  • M.S. in Statistics, Iowa State University (11-13)
    • Thesis: Tools for Collecting and Analyzing MLB PITCHf/x Data (pitchRx).
  • Research Intern, Statistics Research Department at AT&T Labs (Summer '13)
    • Worked with Dr. Kenny Shirley on LDAvis and LDAtools.
  • Took PhD courses and passed written qualifying exam (13-14)
  • Student, Google Summer of Code (Summer '14)
    • Began work on animint
  • Teaching Assistant, Iowa State University (11-15)
  • Mentor, Google Summer of Code (Summer '15)
  • Software Developer, plotly (Summer '15 - Present)
  • Research Assistant, Monash University (Sept '15 - Present)

Proposal Overview

  • The importance of interface design
  • Interfaces for working with web content
  • Interfaces for acquiring data on the web
  • Dynamic interactive statistical web graphics
    • Why interactive?
    • Indirect versus direct manipulation
    • Linked views and pipelines
    • Web graphics
    • Translating R graphics to the web
    • R interfaces for interactive web graphics

Motivation

  • Why interactive & dynamic graphics? They help us:
    • Find high-dimensional, abstract relationships in data that may otherwise go unnoticed
    • Diagnose models by plotting them in the data space (Wickham, Cook, & Hofmann 2015)
    • Explore and understand complicated statistical model fit(s)
    • Communicate/share our work with others in a compelling way
  • Why web-based?
    • simple to share, portable (web browser)
    • encourages composability
    • guide your audience by providing links to interesting selections/states
  • Motivation by example:
    • Sievert & Shirley (2014) develop an interactive web-based visualization for interpreting LDA topic models.

Topic models and LDA

  • Topic models are a collection of statistical models with the common goal of finding hidden structure in a collection of text documents.
  • Basic example: Given a document discussing 'sports', you're more likely to see the word 'baseball' in that document compared to a document discussing 'music' (and vice-versa for 'guitar').
  • Documents usually don't have a clear "topic", but we can develop models with latent RV to "discover topics".
  • Latent Dirichlet Allocation (LDA) is a topic model which allows documents to be mixtures of topics (Blei, Ng, Jordan; 2003).

The Generative Model

  1. Choose # of topics \(K\). Let \(V\) be # of unique words (vocabulary).
  2. For each document \(d\), draw \(\theta_d \sim Dir(\alpha)\) (length \(K\))
  3. For each topic \(k\), draw \(\phi_k \sim Dir(\beta)\) (length \(V\))
  4. Let \(N_d\) be # of words in doc \(d\) and \(n \in \{1, \dots, N_d\}\). For each word \(w_{d, n}\):
    • Draw a (latent) topic, \(z_{d, n} \sim Mult(1, \theta_d)\)
    • Draw a word given topic, \(w_{d, n} \sim Mult(1, \phi_{z_{d, n}})\)

Model fitting

  • Griffiths & Steyvers (2004) derive a collapsed Gibbs sampler. Implemented in R packages LDAtools (Shirley & Sievert, 2013) and lda (Chang, 2015).
  • Wide array of fitting algorithms available in topicmodels (Grun & Hornik, 2011) and mallet (Mimno, 2013).

Model Output

  • In the digital humanities (& elsewhere), LDA is often used to "discover topics" in a large collection of text documents.
  • How are researchers supposed to interpret topics? We can't possibly examine each pmf.
  • "Overview first, then zoom & filter, then detail on demand" (Schneiderman, 1996)

Towards topic interpretation

  • Numerous interactive systems allow users to select a topic \(z\), then list top ~30 words based on \(p(w | z)\) (Gardner et al., 2010; Chaney and Blei, 2012; Snyder et al., 2013).
  • But, words likely to occur overall are also likely to occur for a given topic!
  • Taddy (2011) proposed to rank terms by \(lift = p(w | z)/p(w)\)
  • But if \(p(w)\) is small, \(lift\) is large!
  • Bischof and Airoldi (2012) propose a new model to directly estimate an average frequency and exclusivity to a given topic.
  • Sievert & Shirley (2014) propose choosing \(0 < \lambda < 1\), for: \[ \text{relevance}(\lambda) = \lambda * p(w|z) + (1 - \lambda) * \text{lift} \]

Who is using it?

  • People who use LDA and want a tool for interpreting topics.
  • Combined LDAvis and pyLDAvis currently have 356 stars on GitHub (a measure of popularity).
  • I know a number of consultants, industry workers, and educators using it for exploration, presentation, and teaching. Here are a few videos.
  • Dr. Grant Arndt in the Department of Anthropology at Iowa State University (and his research assistant) are using it as a research aid.

Why is this important?

  • We're enabling analysts to gain insight from sophisticated statistical models, communicate their results, and teach others.

We need better tools

  • Producing interactive and dynamic web graphics from "scratch" (i.e., using HTML/JavaScript/CSS/SVG/d3js) is time-consuming, but very powerful, and flexible.
  • People doing data analysis & statistics don't have the time to learn all these tools. In general, how do we best enable them to create their own interactive dynamic web graphics?
  • I've worked on two R packages in this direction: animint and plotly.
  • Both can translate ggplot2 (Wickham, 2009) graphics to a web-based format (SVG/canvas) and add-on some basic interactive features.
  • ggplot2 is wildly successful thanks to its implementation of a "Grammar of Graphics" (Wilkinson, 1999) which makes it easy to map data to visual displays.
  • animint extends this grammar in a novel direction to enable a constrained form of "linked views" (Hocking et. al., 2015).

library(ggplot2)
p <- qplot(data = iris, x = Sepal.Width, y = Sepal.Length, color = Species)
p

library(plotly)
ggplotly(p)

library(animint)
structure(list(plot = p), class = "animint")

Fix typo!!!

Translating R graphics to the web

  • Pros:
    • Easy to use – extrapolates on existing knowledge/code
    • Doesn't require a Web Server running special software
  • Cons:
    • Translation may depend on internals of other packages
    • To change something that's serialized, you need to re-run R code
    • Hard to extend, customize, and/or add (interactive) features
  • Although pragmatic, if we want a truly interactive web graphics tool, we need a custom interface/language designed for that purpose.
  • Many relevant R packages provide bindings to JavaScript libraries through a JSON specification (e.g., ggvis (Chang & Wickham, 2015), rbokeh (Hafen & Bokeh team, 2015), plotly (Sievert & Plotly team, 2015))

R Bindings to JavaScript Libraries

  • General idea:
    • Start with a HTML/JS/CSS template
    • Abstract away data and layout/appearance options
    • Map a set of R objects to template
myWrapper <- function(...) {
  # compute stuff
  toJSON(list(...))
}
  • The R package htmlwidgets makes it easy for authors to write bindings that play nicely with shiny/rmarkdown/RStudio.

library(plotly)
plot_ly(economics, x = date, y = unemploy / pop)

p <- plot_ly(economics, x = date, y = unemploy / pop)
str(p) 
#> Classes ‘plotly’ and 'data.frame':   478 obs. of  6 variables:
#>  $ date    : Date, format: "1967-06-30" "1967-07-31" ...
#>  $ pce     : num  508 511 517 513 518 ...
#>  $ pop     : int  198712 198911 199113 199311 199498 199657 199808 199920 200056 200208 ...
#>  $ psavert : num  9.8 9.8 9 9.8 9.7 9.4 9 9.5 8.9 9.6 ...
#>  $ uempmed : num  4.5 4.7 4.6 4.9 4.7 4.8 5.1 4.5 4.1 4.6 ...
#>  $ unemploy: int  2944 2945 2958 3143 3066 3018 2878 3001 2877 2709 ...
#>  - attr(*, "plotly_hash")= chr "f638d391dcf53809b8426325a842a091#8"

str(plotly_build(p))
#> List of 2
#>  $ data  :List of 1
#>   ..$ :List of 5
#>   .. ..$ type   : chr "scatter"
#>   .. ..$ x      : Date[1:478], format: "1967-06-30" ...
#>   .. ..$ y      : num [1:478] 0.0148 0.0148 0.0149 0.0158 0.0154 ...
#>  $ layout:List of 2
#>   ..$ xaxis:List of 1
#>   .. ..$ title: chr "date"
#>   ..$ yaxis:List of 1
#>   .. ..$ title: chr "unemploy/pop"

Pure functional programming

  • A function is pure if:
    • Output depends solely on input(s).
    • Has no side-effects (e.g., library(), options())
dim(economics)
#> [1] 478   6
e <- transform(economics, rate = unemploy / pop)
dim(e)
#> [1] 478   7
  • Pure functions are easy to understand in isolation
    • no searching for "lingering" variables that may effect output
  • Modern R packages designed to make data analysis easier use this principle: dplyr (Wickham, 2015), tidyr (Wickham, 2015), broom (Robinson, 2015), ggvis (Chang & Wickham, 2015), rvest (Wickham, 2015).
  • Donoho (2015) states: "This effort may have more impact on today’s practice of data analysis than many highly-regarded theoretical statistics papers."

magrittr's %>% helps us chain together a sequence of pure functions to elegantly express complex tasks.

# f(x, y) becomes x %>% f(y)
economics %>%
  transform(rate = unemploy / pop) %>%
  plot_ly(x = date, y = rate)

economics %>%
  transform(rate = unemploy / pop) %>%
  plot_ly(x = date, y = rate, name = "raw") %>%
  loess(rate ~ as.numeric(date), data = .) %>%
  broom::augment() %>%
  add_trace(y = .fitted, name = "smooth")

economics %>%
  transform(rate = unemploy / pop) %>%
  plot_ly(x = date, y = rate, name = "raw") %>%
  subset(rate == max(rate)) %>%
  layout(annotations = list(x = date, y = rate, text = "Peak", showarrow = T),
         title = "The U.S. Unemployment Rate")

Enabling coordinated, linked views

  • Coordinated, linked views is an important quality of any interactive statistical graphics system (e.g., cranvas, ggobi, iplots, mondrian, MANET, etc).
  • In order to have linked views, we need a "data pipeline" (Buja et.al, 1988); (Wickham et. al., 2010).

Timeline

  • December: Revise and resubmit book chapter on MLB Pitching Expertise and Evaluation for the Handbook of Statistical Methods for Design and Analysis in Sports, a volume that is planned to be one of the Chapman & Hall/CRC Handbooks of Modern Statistical Methods.
  • January: Revise and submit animint paper.
  • Feburary: More support for linked views in plotly.
  • April: Write and submit curating data paper.
  • June: Write and submit interactive web graphics paper.
  • August: Thesis defense.

Thanks to my collaborators

  • LDAvis (Kenny Shirley)
  • animint (Toby Dylan Hocking, Susan VanderPlas, Kevin Ferris, and Tony Tsai)
  • plotly (Toby Dylan Hocking and the Plotly Team)